#
# # Specify custom path to dataset here:
# path_to_file <- "/Users/lingjong/Documents/CABDSS_Assignments/datasets/regions.csv"
# regions_df <- read.csv(path_to_file)
# head(regions_df)
library(dplyr)
library(factoextra)
library(caret)
library(rpart)Assignment (un)supervized machine learning
1 Unsupervized machine learning
Download the ‘regions.zip’ file from the ‘datasets’ chapter and save it in the directory of this file (“Assignment_ML.qmd”).
Variables:
- freehms: Gays and lesbians free to live life as they wish (1: Agree strongly; 5: Disagree strongly)
- hmsacld: Gay and lesbian couples right to adopt children (1: Agree strongly; 5: Disagree strongly)
- hmsfmlsh: Ashamed if close family member gay or lesbian (1: Agree strongly; 5: Disagree strongly)
- eqmgmbg: Bad or good for businesses in [country] if equal numbers of women and men are in higher management positions (0: Very bad; 6: Very good)
- eqpolbg: Bad or good for politics in [country] if equal numbers of women and men are in positions of political leadership (0: Very bad; 6: Very good)
- eqwrkbg: Bad or good for family life in [country] if equal numbers of women and men are in paid work (0: Very bad; 6: Very good)
- eqpaybg: Bad or good for economy in [country] if women and men receive equal pay for doing the same work (0: Very bad; 6: Very good)
- ppltrst: Most people can be trusted or you can’t be too careful (0: You can’t be too careful; 10: Most people can be trusted)
- pplfair: Most people try to take advantage of you, or try to be fair (0: Most people try to take advantage of me; 10: Most people try to be fair)
- pplhlp: Most of the time people helpful or mostly looking out for themselves (0: People mostly look out for themselve; 10: People mostly try to be helpful)
- trstprl: Trust in country’s parliament (0: No trust at all; 10: Complete trust)
- trstlgl: Trust in the legal system (0: No trust at all; 10: Complete trust)
- trstplc: Trust in the police (0: No trust at all; 10: Complete trust)
- trstplt: Trust in politicians (0: No trust at all; 10: Complete trust)
- trstprt: Trust in political parties (0: No trust at all; 10: Complete trust)
- regunit & region: region stipulated the region is which the respondent lives. regunit is the nuts level of this region.
library(tidyverse)
unzip("regions.zip") |>
read_delim(col_names=TRUE,
delim=",",
progress=FALSE,
show_col_types = FALSE,
locale = readr::locale(encoding = "latin1")) |>
mutate(across(c(freehms,hmsacld,hmsfmlsh), ~ifelse(. > 5,NA,.)), #(1)
across(c(eqmgmbg,eqpolbg,eqwrkbg,eqpaybg), ~ifelse(. > 6,NA,.)), #(1)
across(c(trstprl,trstlgl,trstplc,trstplt,trstprt), ~ifelse(. > 10,NA,.))) |> #(1)
mutate(across(c(freehms,hmsacld),~ 6 - .)) |> #(2)
group_by(regunit,region) |>
mutate(n_resp = n()) |> #(3)
pivot_longer(-c(regunit,region,n_resp),names_to = 'var',values_to = 'val') |> #(4)
drop_na() |> #(5)
group_by(regunit,region,var,n_resp) |>
summarise(mu = mean(val,na.rm=TRUE),
.groups='drop') |>
pivot_wider(names_from = var,values_from = mu) -> #(6)
df_regions1.1 Data wrangling
- Consider the code above and explain the lines that are indicated with (1) in your own words.
Answer:
These lines make use of the mutate() method, along with the ifelse() method, which is applied only to the vector of dataset columns provided in the across() function as arguments, for each line. While the mutate() function’s goal is to transform columns of the dataset, the across() function allows one to apply a function to a subset of these columns. Note that the tilde (“~”) refers to some anonymous function and the period (“.”) refers to the input column. The ifelse() method then checks a condition (e.g. . > 5, meaning “is the value strictly greater than 5?”) and assigns NA if the condition is satisfied (TRUE) or leaves the value as it is otherwise.
- Consider the code above and explain the lines that are indicated with (2) in your own words.
Answer:
This line applies a transformation on the columns reflecting sympathy for homosexual rights (freehms,hmsacld), which take values that are counter-intuitive to interpret (e.g.: 1= “Strongly Agree”; 5= “Strongly Disagree”). The transformation thus consists in subtracting the original value from 6, which upends the order of sympathy to make it more intuitive to interpret (e.g.: 5= “Strongly Agree”; 1= “Strongly Disagree”)). The line concludes with a trailing pipe operator, which directs output to the following statement in the code (namely, the group_by() function).
- Consider the code above and explain the lines that are indicated with (3) in your own words.
Answer:
After being organized in groups with group_by(), which groups the data by its geographic features (regunit, region), a new column, n_resp, is added to the dataset with mutate(). n_resp thus holds the number of individual respondents per region. The output is once again piped into the pivot_longer() method.
- Consider the code above and explain the lines that are indicated with (4) in your own words.
Answer:
pivot_longer() reshapes the structure of the dataframe, from “wide” to “long”. The vector prepended with a negative sign indicates the columns which are not subject to this alteration. All other columns are thus collapsed into 2 columns: var and val. var takes on as values the name of the variable, and var takes its corresponding value. This expectedly reduces the horizontal length of the dataset and extends its vertical length. Once again, the output is piped into the next methof, drop_na().
- Consider the code above and explain the lines that are indicated with (5) in your own words.
Answer:
drop_na() merely removes the entries of the resulting dataframe which contain missing values (NA). This only retains valid responses from the survey.
- Consider the code above and explain the lines that are indicated with (6) in your own words.
Answer:
Finally, the data is subjected to grouping once again (in the order provided: regunit,region,var,n_resp), and uses the summary() method to compute the mean (mu) for each variable for each region. After this is done, the summary statistic (=mean) that was yielded is ungrouped with .groups = ‘drop’: this gives a single summary statistic for each entry of the group. Ungrouping is absolutely necessary for the following step which pivots the dataframe back to a wider format (errors would occur otherwise). The dataframe is, as we’ve just announced, ultimately pivoted back to a wider form with pivot_wider(), and the variable names listed in var become once again column names.
1.2 Principle component analysis
head(df_regions)# A tibble: 6 × 15
regunit region n_resp eqmgmbg eqpaybg eqpolbg eqwrkbg freehms hmsacld hmsfmlsh
<dbl> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 CY0 685 4.79 5.11 4.73 4.70 3.37 2.22 3.21
2 1 DE1 332 4.83 5.30 4.82 4.34 4.38 3.98 4.44
3 1 DE2 396 4.59 5.28 4.51 4.22 4.41 3.88 4.41
4 1 DE3 82 5.16 5.40 5.02 5.10 4.52 4.27 4.63
5 1 DE4 72 4.89 5.31 4.65 4.75 4.28 4.07 4.42
6 1 DE6 39 4.95 5.54 5.13 4.76 4.72 4.31 4.67
# ℹ 5 more variables: trstlgl <dbl>, trstplc <dbl>, trstplt <dbl>,
# trstprl <dbl>, trstprt <dbl>
- Perform principle component analysis on the df_regions data. Write the code in chunk ‘pca_chunk’.
# We prepare the numeric data for PCA (excluding region identifiers). Note that we do not use across(), but instead we use dplyr's select and merely exclude the geographic variables that aren't relevant to standardization.
pca_data <- df_regions |>
dplyr::select(-c(regunit, region, n_resp)) |>
scale() # we standardize the data except for regunit, region, n_resp
# PCA:
pca_result <- prcomp(pca_data, center = TRUE, scale. = TRUE)
summary(pca_result)Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 2.5424 1.7840 0.94979 0.74614 0.47482 0.44579 0.40053
Proportion of Variance 0.5386 0.2652 0.07518 0.04639 0.01879 0.01656 0.01337
Cumulative Proportion 0.5386 0.8038 0.87903 0.92542 0.94421 0.96077 0.97414
PC8 PC9 PC10 PC11 PC12
Standard deviation 0.31705 0.29214 0.26692 0.17797 0.1469
Proportion of Variance 0.00838 0.00711 0.00594 0.00264 0.0018
Cumulative Proportion 0.98251 0.98963 0.99556 0.99820 1.0000
# Scree plot:
plot(pca_result, type = "l", main = "Scree Plot of PCA")
# We can look into these tables to get the proportion of variance explained by each component
pve <- summary(pca_result)$importance[2,]
cumulative_pve <- summary(pca_result)$importance[3,]
print(cumulative_pve) PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
0.53864 0.80385 0.87903 0.92542 0.94421 0.96077 0.97414 0.98251 0.98963 0.99556
PC11 PC12
0.99820 1.00000
# This last line lets us see the proportion of cumulative variance that is accounted for by the PCs.
# We investigate the loadings of the PCA:
loadings <- pca_result$rotation
print(loadings[, 1:2]) # we only show the first 2 components PC1 PC2
eqmgmbg -0.2835970 -0.34760890
eqpaybg -0.2741078 -0.32905964
eqpolbg -0.2942313 -0.30179303
eqwrkbg -0.1918765 -0.40014861
freehms -0.3293767 -0.07515595
hmsacld -0.3371318 -0.07797766
hmsfmlsh -0.3297742 -0.08855127
trstlgl -0.2709066 0.34458487
trstplc -0.2715689 0.22122627
trstplt -0.2764262 0.35018040
trstprl -0.3057706 0.28806356
trstprt -0.2703774 0.35613605
fviz_pca_var(pca_result, col.var = "contrib", gradient.cols = c("blue", "orange", "red"))
# This here is a variable correlation plot, which shows how the variables relate with one another according to the first 2 PCs (which account for 80% of the variation in data). Positively correlated covariates are grouped together and share the same direction, whereas negatively correlated covariates are on opposite quadrants. Covariates at right angles from one another may be said to exhibit "independence" from one another. The length of the arrow (which also happens to be color coded) indicates how well the variable is represented by the principal components (PC1 and PC2)
# The following biplot serves merely as a corroborattion of these results:
fviz_pca_biplot(pca_result, col.var = "contrib", gradient.cols = c("blue", "orange", "red"))
- Discuss and decide how many principal components you wish to retain.
Answer:
Looking at the scree plot, we can see that the “elbow” occurs at around the 3rd or 4th principal component. The hardest bend in the scree plot occurs when the 3rd PC is reached, therefore we could choose to retain the first 3 principal components. Altogether, they account for a cumulative variance of 87.9% in the data.
However, a principal component’s eigenvalue may be interpreted as the standard deviation in data that it can capture. Since the variables have been standardized (and thus have a variance of 1), we should retain the principal components with an eigenvalue greater than 1 and discard the others. The following snippet shows us that only the 2 first principal components yield an eigenvalue greater than 1 (6.46365775 and 3.18254447), the 3rd one falling short of 1 by a little but still significant amount (0.90210543):
eigenvalues <- pca_result$sdev^2
print(eigenvalues) [1] 6.46365775 3.18254447 0.90210543 0.55672683 0.22545451 0.19872569
[7] 0.16042258 0.10051973 0.08534528 0.07124860 0.03167210 0.02157702
We can thus conclusively choose to retain only the first 2 principal components.
- Discuss how you would like to label/interpret the retained principal components.
Answer:
To appropriately interpret the retained principal components, we investigate the loadings for each one of our retained principal components:
loadings <- pca_result$rotation
print(loadings[, 1:2]) PC1 PC2
eqmgmbg -0.2835970 -0.34760890
eqpaybg -0.2741078 -0.32905964
eqpolbg -0.2942313 -0.30179303
eqwrkbg -0.1918765 -0.40014861
freehms -0.3293767 -0.07515595
hmsacld -0.3371318 -0.07797766
hmsfmlsh -0.3297742 -0.08855127
trstlgl -0.2709066 0.34458487
trstplc -0.2715689 0.22122627
trstplt -0.2764262 0.35018040
trstprl -0.3057706 0.28806356
trstprt -0.2703774 0.35613605
print("Loadings for PC1:")[1] "Loadings for PC1:"
sort(loadings[, "PC1"], decreasing = FALSE) # Since for PC1, all loadings have a negative sign hmsacld hmsfmlsh freehms trstprl eqpolbg eqmgmbg trstplt
-0.3371318 -0.3297742 -0.3293767 -0.3057706 -0.2942313 -0.2835970 -0.2764262
eqpaybg trstplc trstlgl trstprt eqwrkbg
-0.2741078 -0.2715689 -0.2709066 -0.2703774 -0.1918765
print("Loadings for PC2:")[1] "Loadings for PC2:"
sort(loadings[, "PC2"], decreasing = TRUE) # Since some loadings have positive signs for PC2 trstprt trstplt trstlgl trstprl trstplc freehms
0.35613605 0.35018040 0.34458487 0.28806356 0.22122627 -0.07515595
hmsacld hmsfmlsh eqpolbg eqpaybg eqmgmbg eqwrkbg
-0.07797766 -0.08855127 -0.30179303 -0.32905964 -0.34760890 -0.40014861
We can see that PC1 yields important loadings for hmsacld, hmsfmlsh, freehms, and trstprl (which are all negative, and thus indicate an inverse relation to PC1).
We can see that PC2 yields important loadings for trstprt, trstplt, trstlgl, trstprl (which are all positive, and thus indicate a normal relation to PC2).
Considering the transformation that occurred at #(2) for the variables freehms and hmsacld, which marked higher scores for sympathy regarding homosexual rights (i.e. right to marriage and right to have children), a negative loading would indicate opposition to that trend. PC1 thus captures a lack of sympathy with regard to homosexual rights, along with a sentiment of shame associated with homosexuality (hmsfmlsh, which wasn’t transformed like the 2 previous variables but also had lower scores for higher levels of shame -> hence the negative loading for hmsfmlsh), and aligns it with trust in parliament (trstprl). Overall, we can argue that PC1 captures a conservative, “traditional” line of thought with a specific focus on the question of homosexual rights in society.
For PC2, the strongest positive loadings indicate trust in the institutions (political parties, politicians, legal system, parliament). PC2 may thus also be associated with a strain of conservatism, but which focuses much less on the issue of homosexuality and which follows a greater alignment with the institutions.
- Do you prefer to have the observed variables standardized before running the pca? Why?
Answer:
Yes, since different scales of magnitude for the covariates may bias the results of our PCA results, or at the very least make their interpretation more complicated. Some variables may have very different variances, so standardization puts them on an equal footing for principal components to capture the variation in data. Standardization is critical in our methodology for answering question (2), where we use the Kaiser criterion to discard principal components with eigenvalues below 1: if our data hadn’t been standardized, this approach would not have been possible.
n_pc<-2 #or any other number of PC's you want to retain
# prcomp.res is the outcome of prcomp-function
df_regions |>
bind_cols(predict(pca_result, newdata = df_regions)[,1:n_pc])->
df_regions_pca1.3 Cluster analysis
- Run k-means & hierarchical cluster analysis. Write the code in chunk ‘cluster_chunk’.
See code below.
- Would you prefer to run both analyses on the retained PC’s or on the original set of observed variables? Argue.
Answer:
Before answering, it is useful to recall that the goal of these clustering procedures is to ensure that the clusters are internally homogeneous (small within variance) and externally heterogeneous (large between variance).
We believe that the core advantage of PCA is that it reveals the most determinant dimensions that shape the data. While retaining but a couple of PC’s obviously entails a reduction in the complexity that is characteristic to the data, it allows for us to reduce noise and only yield insights from the most uncorrelated components that account for most of the data’s variation. Moreover, clustering methods only become more difficult to implement as the dimensionality in the data increase, making distance between points less capable to correctly discriminate groupings. Working on the original data with these methods allows for more noise to inform the clusters, whereas working with the principal components of the data allows for clustering to occur around its most informative structures.
- Would you advice to standardize the observed variables before running the cluster algorithms?
Answer:
Just like with PCA, standardizing the variables is important to accomplish before implementing the clustering routines. This is because our data are expressed on different scales, and these all factor into the formulas which calculate distance between points (e.g. linkage types used in hierarchical clustering). Without standardization, some variables would obviously yield a bias the computations for distance between points.
- Would you advice to standardize the PC’s before running the cluster algorithms?
Answer:
No, since the standardization has already occurred before running the PCA, from which the principal components were yielded. The scale of each of the principal components should normally reflect the true magnitude of one of the uncorrelated dimensions which constitute the data, and are thus informative as such. Standardizing principal components would be ill-advised, since this would dilute the magnitude of some of these components and merely replace information with noise.
- How many clusters of regions do you wish to retain? Argue.
Answer:
We choose to retain 4 clusters. Our motivation lies in the “elbow” we can observe in our graph where we plot the within-cluster sum of squares (within SS) and the between-cluster sum of squares (between SS) against the number of clusters (k). Additional clusters does not translate to substantial increase in between variance (between SS) or to substantial decrease in within variance (within SS)
- Interpret the clusters you retained.
Answer:
Looking at the k-means clustering plot across the PC1-PC2 grid, most clusters seem to aggregate in the lower quadrants, where PC2 assumes negative values.
- Cluster 1 (in red): lower-right quadrant, groups individuals negatively associated with PC2 and positively associated with PC1, namely individuals that are strictly unsupportive of homosexuals’ rights (and show little trust in institutions).
- Cluster 2 (in green): mostly lower-left quadrant, groups individuals associated with eq* variables, indicating support of equal treatment and opportunities for male and female sexes.
- Cluster 3 (in blue): upper-right quadrant, groups individuals positively associated with both PC1 and PC2, namely individuals that are simultaneously unsupportive of homosexuals’ rights and appear to trust legal/political institutions.
- Cluster 4 (in purple): mostly upper-left quadrant, groups individuals associated with trst* variables, indicating trust in the institutions.
The geospatial plot shows the different levels of PC1 across the several regions listed in the dataset. One can clearly see that higher levels of PC1 are more often found in the eastern countries and regions of the EU, whereas the northernmost countries and regions appear to record the lowest levels of PC1.
df_regions_pca |>
full_join(giscoR::gisco_get_nuts(nuts_level = 'all',
resolution="20",
year = 2021),
by=join_by(regunit==LEVL_CODE,region==NUTS_ID)) |>
sf::st_as_sf() |>
sf::st_crop(c(xmin = -12, ymin = -2, xmax = 56, ymax = 71)) |>
group_by(CNTR_CODE) |>
mutate(m = max(regunit * as.numeric(!is.na(n_resp)))) |>
filter((m == 0 & regunit == min(regunit)) | (m > 0 & m == regunit)) |>
ungroup() |>
mutate_if(is.numeric,~round(.,2))->
df_pca_cl
ggplot(df_pca_cl)+
geom_sf(aes(fill = PC1),lwd = 0)+
theme_bw()+
theme(panel.grid.minor = element_blank(),
panel.grid.major = element_blank(),
axis.ticks = element_blank(),
axis.text = element_blank(),
axis.title=element_blank(),
panel.border = element_blank(),
panel.background = element_blank())
# K-MEANS
pca_data <- prcomp(df_regions |> dplyr::select(-c(regunit, region, n_resp)), scale. = TRUE)
df_kmeans <- as.data.frame(pca_data$x[, 1:2])
solution<-data.frame()
for(k in 1:10){
solution_k<-kmeans(df_kmeans,centers = k,nstart = 25)
solution<-rbind(solution,tibble(k=k,
tot.withinss=solution_k$tot.withinss,
betweenss=solution_k$betweenss))
}
# Again, we look for an elbow in the graph. This elbow clearly occurs at k=4. Beyond this amount of clusters, performance of k-means does not improve substantially enough by reducing variability within clusters (or, simultaneously, increasing variability between clusters): it is therefore parsimonious to set k=4.
solution |>
pivot_longer(c(tot.withinss,betweenss),
names_to = "stat",
values_to = "value") |>
ggplot()+
geom_line(aes(x=k,y=value,color=stat))+
scale_x_continuous(breaks=c(1:10))
# Use top 2 PCs for clustering
df_kmeans <- as.data.frame(pca_result$x[, 1:2])
set.seed(1) # We set the randomness seed reproducibility
kmeans_result <- kmeans(df_kmeans, centers = 4, nstart = 25)
df_clusters <- df_regions |>
mutate(kmeans_cluster = factor(kmeans_result$cluster))
# We visualize clustering results:
fviz_cluster(list(data = df_kmeans, cluster = kmeans_result$cluster),
geom = "point", ellipse.type = "convex",
main = "K-means Clustering (PC1 & PC2)",
repel = TRUE)
# Compute distance matrix
distMatrix <- dist(df_kmeans, method = "euclidean")
# Apply hierarchical clustering using Ward linkage
hc <- hclust(distMatrix, method = "ward.D2")
# Cut into 4 clusters
hc_clusters <- cutree(hc, k = 4)
# Add hierarchical cluster assignment
df_clusters$hclust_cluster <- factor(hc_clusters)
# Hierarchical Clustering Dendrogram plot:
plot(hc, labels = df_regions$region,
main = "Hierarchical Clustering Dendrogram",
xlab = "", sub = "")
rect.hclust(hc, k = 4, border = "blue")
n_clus_hier <- 4 # or any other number of retained clusters (hierarchical)
kmeans_result #<- outcome of a kmeans algorithmK-means clustering with 4 clusters of sizes 82, 78, 49, 55
Cluster means:
PC1 PC2
1 -2.6057627 1.095775
2 -0.8952744 -1.373402
3 3.1424567 2.109323
4 2.3549741 -1.565182
Clustering vector:
[1] 4 1 1 2 2 1 1 2 1 1 1 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 4 2 4 1 1 1 3
[38] 1 1 1 1 3 1 1 2 2 1 1 2 2 2 1 2 1 1 1 1 1 1 1 4 3 3 3 4 3 3 3 2 3 3 3 3 2
[75] 4 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[112] 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 4 4 2
[149] 2 2 2 2 4 4 4 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4
[186] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 3 3 3 3 4 3 3 3 3 4 3 3 3 3 3 3 3
[223] 3 2 3 1 1 1 1 1 1 2 1 1 3 3 3 3 3 3 3 3 3 3 2 2 2 4 4 4 4 4 2 4 2 4 3 3 3
[260] 3 3 3 3 3
Within cluster sum of squares by cluster:
[1] 161.7435 123.7563 138.9997 105.9151
(between_SS / total_SS = 79.1 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
df_regions_pca |>
mutate(cl_hierarch = cutree(hc, k = n_clus_hier),
cl_kmeans = kmeans_result$cluster)->
df_regions_pca_cl
df_regions_pca_cl |>
full_join(giscoR::gisco_get_nuts(nuts_level = 'all',
resolution="20",
year = 2021),
by=join_by(regunit==LEVL_CODE,region==NUTS_ID)) |>
sf::st_as_sf() |>
sf::st_crop(c(xmin = -12, ymin = -2, xmax = 56, ymax = 71)) |>
group_by(CNTR_CODE) |>
mutate(m = max(regunit * as.numeric(!is.na(n_resp)))) |>
filter((m == 0 & regunit == min(regunit)) | (m > 0 & m == regunit)) |>
ungroup() |>
mutate_if(is.numeric,~round(.,2))->
df_pca_cl
ggplot(df_pca_cl)+
geom_sf(aes(fill = PC1),lwd = 0)+
theme_bw()+
theme(panel.grid.minor = element_blank(),
panel.grid.major = element_blank(),
axis.ticks = element_blank(),
axis.text = element_blank(),
axis.title=element_blank(),
panel.border = element_blank(),
panel.background = element_blank())
2 Supervized machine learning
- Solitude detection
- This analysis seeks to identify individuals who report to be lonely. This instrument could be used by social organisations to predict people’s risk of being lonely, after completing a short questionnaire.
Download the ‘solitude.zip’ file from the ‘datasets’ chapter and save it in the directory of this file (“Assignment_ML.qmd”).
Variables:
- lonely_obs feeling lonely (0: no, 1:yes)
- male: respondent is mele
- agea: respondents’ age
- mnactic: respondents’ main activity (Retired, PaidWork, Education, Unemployed, sickDisabled, Housework, Other)
- domicil: respondent lives in: bigcity, countryside, suburb, town, village
- hhmmb: Number of people living regularly as member of household
- eduyrs: Years of full-time education completed
- trust_other: trust in other people (0: no trust, 10: a lot of trust)
- health: Subjective general health (1 : very good, 5: very bad)
- netustm: Internet use, how much time on typical day, in minutes
- aesfdrk: Feeling of safety of walking alone in local area after dark (1: Very safe, 4: very unsafe)
- hlthhmp: Hampered in daily activities by illness/disability/infirmity/mental problem (0: no, 1: yes)
- hincfel: Feeling about household’s income nowadays (1: Living comfortably on present income, 4: Very difficult on present income)
unzip("solitude.zip") |>
read_delim(col_names=TRUE,
delim=",",
progress=FALSE,
show_col_types = FALSE,
locale = readr::locale(encoding = "latin1"))->
df_solitude- Describe the step and decisions you took when wrangling data. When there is discussion in your group about the right decision, briefly describe the different (opposing) opinions. Code this steps in the ‘wrangle’-chunck.
Answer:
First and foremost, we had a look at all features of the dataset. There appeared to be no missing values, which conspicuously alleviated our wrangling routines. One thing which caught our eye were the categorical variables: Rather than opting for one-hot-encoding, which leads us to inflate the feature space, we used “factors” instead, which are convenient in R for regressions and are more memory-efficient. Considering that we have 7 possible values for mnactic and 5 possible values for domicil, this was clearly the most parsimonious approach.
We then considered reversing the Likert scales for health, hlthhmp and hincfel, in a way that would be more intuitive to grasp (e.g. 1 = poor health; 5 = excellent health). Leaving these scales as they are, we believe, would have obfuscated their interpretation.
df_solitude |> glimpse()Rows: 37,775
Columns: 13
$ lonely_obs <dbl> 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0…
$ male <dbl> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1…
$ agea <dbl> 65, 21, 53, 78, 64, 59, 77, 69, 52, 75, 44, 49, 63, 67, 41…
$ mnactic <chr> "Retired", "Education", "PaidWork", "Retired", "Retired", …
$ domicil <chr> "town", "bigcity", "town", "bigcity", "village", "village"…
$ hhmmb <dbl> 2, 1, 3, 1, 2, 2, 1, 2, 3, 2, 1, 2, 2, 1, 4, 2, 2, 1, 1, 2…
$ eduyrs <dbl> 12, 14, 16, 14, 12, 16, 8, 17, 12, 9, 17, 14, 12, 11, 13, …
$ trust_other <dbl> 1.5, 1.1, 2.3, 1.8, 1.7, 2.0, 2.2, 2.4, 2.2, 1.0, 1.6, 1.9…
$ health <dbl> 3, 2, 1, 3, 2, 1, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 1, 2, 2, 2…
$ netustm <dbl> 180, 570, 30, 0, 120, 120, 0, 240, 180, 0, 120, 60, 120, 1…
$ aesfdrk <dbl> 2, 3, 3, 3, 1, 2, 3, 2, 2, 2, 2, 2, 1, 2, 1, 1, 1, 3, 1, 1…
$ hlthhmp <dbl> 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ hincfel <dbl> 1, 2, 1, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 1, 1, 1, 1, 4, 1, 1…
attach(df_solitude)
# We saw that there were no missing values -> no actions taken
df_solitude$mnactic <- as.factor(df_solitude$mnactic)
df_solitude$domicil <- as.factor(df_solitude$domicil)
# Rather than opting for one-hot-encoding, which leads us to inflate the feature space, we use "factors" instead, which are convenient in R for regressions and are more memory-efficient. Considering that we have 7 possible values for mnactic and 5 possible values for domicil, this was clearly the approach to take.
# We considered reversing the Likert scales for health, hlthhmp and hincfel, which would be more intuitive to grasp (e.g. 1= poor health; 5= excellent health). Leaving these scales as they are, we believe, would obfuscate our interpretation.
df_solitude |> mutate(across(c(health), ~ 6 - .)) |> mutate(across(c(aesfdrk, hincfel), ~ 5 - .))# A tibble: 37,775 × 13
lonely_obs male agea mnactic domicil hhmmb eduyrs trust_other health
<dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl> <dbl> <dbl>
1 0 1 65 Retired town 2 12 1.5 3
2 1 0 21 Education bigcity 1 14 1.1 4
3 0 0 53 PaidWork town 3 16 2.3 5
4 1 0 78 Retired bigcity 1 14 1.8 3
5 0 1 64 Retired village 2 12 1.7 4
6 0 0 59 PaidWork village 2 16 2 5
7 1 0 77 Retired town 1 8 2.2 4
8 0 0 69 Retired village 2 17 2.4 4
9 0 0 52 PaidWork village 3 12 2.2 4
10 0 0 75 Retired village 2 9 1 4
# ℹ 37,765 more rows
# ℹ 4 more variables: netustm <dbl>, aesfdrk <dbl>, hlthhmp <dbl>,
# hincfel <dbl>
- Split, train and assess supervised machine learning algorithms and select the best solution. Describe the different steps and decisions you took. When there is discussion in your group about the right decision, briefly describe the different (opposing) opinions. Code this steps in the ‘wrangle’-chunk.
ANSWER:
One of the first points of debate regarded defining the proportion of training and test sets. Considering that df_solitude consists of 37775 entries, we believe it is safe to split the data with a 30% allocation to test and a 70% allocation to training. If the sample had been smaller, we may have considered adopting a 20% - 80% ratio instead, to minimize the possibility of undertraining.
Then, we considered that a useful understanding the causes of loneliness could be best achieved by implementing a predictive model that could allow for inference on individual covariates, along with a dimension of interpretability. It was thereby agreed to opt for logistic regression, as it provided such elements.
We finally implemented a visualization of the model’s ROC curve, and returned its associated AUC.
set.seed(222) # We set a random seed for the random number generator, for reproducibility of our results
split<-rsample::initial_split(df_solitude,prop = 0.7)
train<-rsample::training(split)
valid<-rsample::testing(split)
m.logistic <- glm(lonely_obs ~ .,
data=train,
family=binomial(link="logit"))
phat<-predict(m.logistic, valid, type = 'response')
confusion <- table(as.numeric(phat>0.5),
valid$lonely_obs)
caret::confusionMatrix(table(as.numeric(predict(m.logistic, valid, type = 'response')>0.5), valid$lonely_obs), positive="1")Confusion Matrix and Statistics
0 1
0 7216 2554
1 581 982
Accuracy : 0.7234
95% CI : (0.715, 0.7316)
No Information Rate : 0.688
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.2398
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.27771
Specificity : 0.92548
Pos Pred Value : 0.62828
Neg Pred Value : 0.73859
Prevalence : 0.31201
Detection Rate : 0.08665
Detection Prevalence : 0.13792
Balanced Accuracy : 0.60160
'Positive' Class : 1
ROCR::prediction(as.numeric(phat),valid$lonely_obs) |>
ROCR::performance("tpr","fpr")->
roc
plotly::plot_ly(type="scatter",
mode="lines",
x=~roc@x.values[[1]],
y=~roc@y.values[[1]],
name='logistic',
text=~paste("<br> fpr (1-specificity)",round(roc@x.values[[1]],2),
"<br> tpr (sensitivity)",round(roc@y.values[[1]],2),
"<br> cutoff",round(roc@alpha.values[[1]],2)),
hoverinfo='text') |>
plotly::add_trace(x=~roc@x.values[[1]],
y=~roc@x.values[[1]],
hoverinfo="none",
name='no model') |>
plotly::layout(xaxis=list(title="1-Specificty: false positive rate: FP / (FP + TN))"),
yaxis=list(title="Recall/Sensitivity: true positive rate: TP / (TP + FN)"))ROCR::prediction(as.numeric(phat),valid$lonely_obs) |>
ROCR::performance("auc")->
auc
print(glue::glue("Logit AUC: {round(auc@y.values[[1]],4)}"))Logit AUC: 0.7086
summary(m.logistic)
Call:
glm(formula = lonely_obs ~ ., family = binomial(link = "logit"),
data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1878 -0.8457 -0.6200 1.0862 2.6492
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.0100802 0.1312691 0.077 0.938790
male -0.2595557 0.0298305 -8.701 < 2e-16 ***
agea -0.0101532 0.0013389 -7.583 3.38e-14 ***
mnacticHousework -0.3487570 0.0894481 -3.899 9.66e-05 ***
mnacticOther -0.2095318 0.1455798 -1.439 0.150068
mnacticPaidWork -0.3831365 0.0656412 -5.837 5.32e-09 ***
mnacticRetired -0.3476832 0.0875127 -3.973 7.10e-05 ***
mnacticsickDisabled -0.1197898 0.1167502 -1.026 0.304875
mnacticUnemployed 0.0194466 0.0924441 0.210 0.833386
domicilcountryside -0.2321279 0.0746079 -3.111 0.001863 **
domicilsuburb -0.0710817 0.0525136 -1.354 0.175868
domiciltown -0.1574985 0.0406611 -3.873 0.000107 ***
domicilvillage -0.2216005 0.0403117 -5.497 3.86e-08 ***
hhmmb -0.3143725 0.0132093 -23.799 < 2e-16 ***
eduyrs -0.0172084 0.0039650 -4.340 1.42e-05 ***
trust_other -0.3425338 0.0256657 -13.346 < 2e-16 ***
health 0.3224480 0.0201950 15.967 < 2e-16 ***
netustm 0.0001160 0.0000901 1.288 0.197816
aesfdrk 0.1567689 0.0196077 7.995 1.29e-15 ***
hlthhmp 0.1915295 0.0375544 5.100 3.40e-07 ***
hincfel 0.3938725 0.0193462 20.359 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 32871 on 26441 degrees of freedom
Residual deviance: 29557 on 26421 degrees of freedom
AIC: 29599
Number of Fisher Scoring iterations: 4
Our logistic regression has yielded an AUC of 0.7086 (area under the ROC curve), which is reasonably decent, or at the very least acceptable for a predictive model. We can also show interesting results when it comes to individual coefficients, which are interpreted as impacting the log odds of yielding a “1” value for lonely_obs.
Almost all coefficients appeared to be statistically significant. The strongest impacting factors of loneliness include (in decreasing orders of magnitude, and all very statistically significant): - Living comfortably on present household income (-> hincfel) - not being retired or not working a job (-> mnacticRetired, mnacticPaidWork) - feeling healthier subjectively (-> hlthhmp) - having less people around in one’s household (-> hhmmb) - investing more trust in others (-> trust_other) - not being a man (-> male)
As a final additional effort of investigation, we considered applying a different link function (which would substitute logit): since there are much less “1” observations for the dependent variable lonely_obs than “0” observations (31.28% of “1” values), we believed that this asymmetry in the distribution of outcome values could best be fitted with a sigmoid function that is much more sensible to covariates as the outcome approaches “1”. Such a sigmoid function is found in the cloglog (complementary log-log) link function.
m.cloglog <- glm(lonely_obs ~ .,
data=train,
family=binomial(link="cloglog"))
phat_cloglog<-predict(m.cloglog, valid, type = 'response')
confusion_cloglog <- table(as.numeric(phat_cloglog>0.5),
valid$lonely_obs)
caret::confusionMatrix(table(as.numeric(predict(m.cloglog, valid, type = 'response')>0.5), valid$lonely_obs), positive="1")Confusion Matrix and Statistics
0 1
0 7270 2599
1 527 937
Accuracy : 0.7242
95% CI : (0.7158, 0.7324)
No Information Rate : 0.688
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.235
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.26499
Specificity : 0.93241
Pos Pred Value : 0.64003
Neg Pred Value : 0.73665
Prevalence : 0.31201
Detection Rate : 0.08268
Detection Prevalence : 0.12918
Balanced Accuracy : 0.59870
'Positive' Class : 1
ROCR::prediction(as.numeric(phat_cloglog),valid$lonely_obs) |>
ROCR::performance("tpr","fpr")->
roc
plotly::plot_ly(type="scatter",
mode="lines",
x=~roc@x.values[[1]],
y=~roc@y.values[[1]],
name='cloglog logistic',
text=~paste("<br> fpr (1-specificity)",round(roc@x.values[[1]],2),
"<br> tpr (sensitivity)",round(roc@y.values[[1]],2),
"<br> cutoff",round(roc@alpha.values[[1]],2)),
hoverinfo='text') |>
plotly::add_trace(x=~roc@x.values[[1]],
y=~roc@x.values[[1]],
hoverinfo="none",
name='no model') |>
plotly::layout(xaxis=list(title="1-Specificty: false positive rate: FP / (FP + TN))"),
yaxis=list(title="Recall/Sensitivity: true positive rate: TP / (TP + FN)"))ROCR::prediction(as.numeric(phat_cloglog),valid$lonely_obs) |>
ROCR::performance("auc")->
auc
print(glue::glue("Cloglog AUC: {round(auc@y.values[[1]],4)}"))Cloglog AUC: 0.7087
The results of this attempt do not provide significant improvements in prediction. The AUC has remained the same (only increasing by a very minute amount), and what’s more frustrating about the cloglog link function is the lack of ease when it comes to the interpretability of the model coefficients. We thus retain our conclusions from the model which used a logit link function.